For this section, we will visualize each dataset outlined in the Data Sources tab, allowing us to better understand the nature of each area related to public transit usage. This will be helpful for identifying trends, seasonality, and cycles, each of which will be further analyzed in the EDA section.
library(plotly)fig <- ridership %>%plot_ly(x =~`Quarter-Year`,y =~`Total Ridership (000s)`,type ='scatter',mode ='lines',showlegend = F )%>% plotly::layout(title =list(text="Quarterly Total Public Transit Ridership"),xaxis =list(title="Date"),yaxis =list(title="Public Transit Trips (000s)"))fig
This graph gives a good general overview of public transit usage before, during, and after the pandemic. Usage appeared to be moving relatively positively, although there is a noticeable decline even before 2020 which indicates that a reduction may have been occurring independent of COVID-19. However, this pales in comparison to the drastic drop in 2020 and subsequent rebound, of which we still likely haven’t seen the end. Finally, there may be a seasonality aspect, as Q1 always appears to be the lowest value in its cluster.
Code
library(dplyr)recent_ridership <- ridership %>%filter(Year >2015)fig <-plot_ly(recent_ridership, x =~`Quarter-Year`, y =~`Heavy Rail (000s)`, type ='bar', name ="Heavy Rail") %>%add_trace(y =~`Light Rail (000s)`, name ="Light Rail") %>%add_trace(y =~`Commuter Rail (000s)`, name ="Commuter Rail") %>%add_trace(y =~`Trolleybus (000s)`, name ="Trolleybus") %>%add_trace(y =~`Bus (000s)`, name ="Bus") %>%add_trace(y =~`Demand Response (000s)`, name ="Demand Response") %>%add_trace(y =~`Other (000s)`, name ="Other") %>%layout(yaxis =list(title ='Public Transit Trips (000s)'), barmode ='stack',xaxis =list(title ='Date'),title =list(text ='Ridership by Transit Type since 2015'))fig
This stacked bar chart simply shows the share that each transit type has in contributing to the first graph. The main takeaways are that heavy rails and buses are by far the most used types, and that as total usage increases/decreases, each transit type’s usage increases/decreases accordingly. There does not appear to be a type that thrives or diminishes at an opposite rate to the others.
Code
#install.packages("ggcorrplot")library(ggcorrplot)library(ggplot2)corr_matrix <-cor(ridership[c(5:11)], use ="complete.obs")ggcorrplot(corr_matrix, method ="square",type ="full",lab =TRUE) +ggtitle("Correlation Matrix of Public Transit Types") +theme(plot.title =element_text(hjust =0.5))
This shows what we surmised in the previous graph, that transit types are mostly positively correlated with one another, and heavily so. Buses and trolleybuses appear to be less correlated with the rest, but it is reasonable to generalize that as one transit type’s usage increases, so will the rest.
mta <-read_csv("../code/MTA_ridership.csv")mta$Date<-as.Date(mta$Date,"%m/%d/%Y")mta <- mta[order(mta$Date),]head(mta)
Code
mta$total_ridership <- mta$`Subways: Total Estimated Ridership`+ mta$`Buses: Total Estimated Ridership`+ mta$`LIRR: Total Estimated Ridership`+ mta$`Metro-North: Total Estimated Ridership`+ mta$`Staten Island Railway: Total Estimated Ridership`mta$weekly_ma <-ma(mta$`total_ridership`, 7)
Code
fig <- mta %>%plot_ly(x =~Date,y =~`total_ridership`,type ='scatter',mode ='lines',name ="Daily Ridership",showlegend = T )%>%add_lines(y =~weekly_ma, name ="7-day Moving Average", line =list(color ='red')) %>%layout(title =list(text="MTA: Daily Total Estimated Ridership"),xaxis =list(title="Date"),yaxis =list(title="Total Daily Public Transit Trips"))fig
This gives a more granular overview of the usage decline and rebound in New York City, as we can see the daily values. The moving average is easier to decipher, and shows a trend similar to the general data for all of the United States. However, it is possible that the recovery is tapering off a bit. One thing to note is that there are drastically different values even within a small date range, suggesting that there is a better way to split up this data. Thus, the next graph will split by weekday and weekend.
Code
mta$day <-weekdays(as.Date(mta$Date))mta <- mta %>%mutate(day_type =case_when( day %in%c("Saturday", "Sunday") ~"Weekend",TRUE~"Weekday" ))head(mta)
Code
fig <- mta %>%plot_ly(x =~Date,y =~`total_ridership`,group =~day_type,color =~day_type,type ='scatter',mode ='lines',showlegend = T )%>%layout(title =list(text="MTA: Daily Total Estimated Ridership (Split by Weekends and Weekdays)"),xaxis =list(title="Date"),yaxis =list(title="Total Daily Public Transit Trips"))
Warning in plot_ly(., x = ~Date, y = ~total_ridership, group = ~day_type, : The group argument has been deprecated. Use `group_by()` or split instead.
See `help('plotly_data')` for examples
Code
fig
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
This confirms the suspicion that a great deal of the variation is simply due to the day of week. Additionally, many of the drops in the weekdays line come on federal holidays. With working weekdays being significantly higher in ridership, we can presume that a great deal of usage comes from the need to commute to work and school, at least in New York City.
fig <- cta %>%plot_ly(x =~Date,y =~`total_rides`,type ='scatter',mode ='lines',name ="Daily Ridership",showlegend = T )%>%add_lines(y =~weekly_ma, name ="30-day Moving Average", line =list(color ='red')) %>%layout(title =list(text="CTA: Daily Total Estimated Ridership"),xaxis =list(title="Date"),yaxis =list(title="Total Daily Public Transit Trips"))fig
In Chicago we see an unmistakable steady decline in usage prior to the pandemic. The recovery appears to be somewhat slow, which is certainly something to analyze in future sections. It is possible that some of the struggles go beyond the impact of the pandemic.
Code
fig <- cta %>%plot_ly(x =~Date,y =~`total_rides`,group =~type,color =~type,type ='scatter',mode ='lines',showlegend = T )%>%layout(title =list(text="CTA: Daily Total Estimated Ridership (Split by Weekends and Weekdays)"),xaxis =list(title="Date"),yaxis =list(title="Total Daily Public Transit Trips"))
Warning in plot_ly(., x = ~Date, y = ~total_rides, group = ~type, color = ~type, : The group argument has been deprecated. Use `group_by()` or split instead.
See `help('plotly_data')` for examples
Code
fig
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Finally, this tells the same story as the split graph in New York City. Weekdays consistently produce higher ridership, indicating that commutes to work and school represent a great share of total usage.
Monthly Ridership
To get a more specific view of how and why public transit ridership trends occur, we will now take a look at data that breaks it down by city, accounting for all metropolitan areas in the U.S. where applicable. The following plot simply shows total numbers by city for the most recent time step to show us which cities are most impacted by the industry. We will then use this data to decide where to focus our analysis.
upt_subset <- cities_agg %>% dplyr::select(`UZA Name`, `Feb-25`) %>% dplyr::rename(upt =`Feb-25`)vrm_subset <- vrm_agg %>% dplyr::select(`UZA Name`, `Feb-25`) %>% dplyr::rename(vrm =`Feb-25`)# Join them by Citycombined_df <-inner_join(upt_subset, vrm_subset, by ="UZA Name")combined_df$upt <-log(combined_df$upt)combined_df$vrm <-log(combined_df$vrm)# Create scatter plotplot_ly(combined_df, x =~upt, y =~vrm, type ='scatter', mode ='markers',text =~`UZA Name`,marker =list(size =10)) %>%layout(title ="City Comparison for February 2025",xaxis =list(title ="Log(Unlinked Passenger Trips)"),yaxis =list(title ="Log(Vehicle Ridership Miles)"))
This plot shows that vehicle miles and total passenger trips are heavily correlated, which is unsurprising. Due to the gap that occurs after the top nine cities in both categories, our city-by-city analyses will focus only on these nine cities:
New York, NY
Los Angeles, CA
Chicago, IL
Washington, DC
San Francisco, CA
Philadelphia, PA
Boston, MA
Seattle, WA
Miami, FL
Interestingly, this is not the list of the most populous cities in the U.S., indicating that there are great disparities in public transit usage beyond just volume or population density.
combined_wide$Month <-as.Date(paste0(combined_wide$Month, "-01"), format ="%b-%y-%d")write_csv(combined_wide, "../data/monthly_data.csv")
Code
plot <-plot_ly(combined_wide, x =~Month) %>%add_lines(y =~`UPT - New York--Jersey City--Newark, NY--NJ`, name ="New York, NY", line =list(color ='darkgreen')) %>%add_lines(y =~`UPT - Los Angeles--Long Beach--Anaheim, CA`, name ="Los Angeles, CA", line =list(color ='orange')) %>%add_lines(y =~`UPT - Chicago, IL--IN`, name ="Chicago, IL", line =list(color ='#cccc00')) %>%add_lines(y =~`UPT - Washington--Arlington, DC--VA--MD`, name ="Washington, DC", line =list(color ='green')) %>%add_lines(y =~`UPT - San Francisco--Oakland, CA`, name ="San Francisco", line =list(color ='blue')) %>%add_lines(y =~`UPT - Philadelphia, PA--NJ--DE--MD`, name ="Philadelphia, PA", line =list(color ='purple')) %>%add_lines(y =~`UPT - Boston, MA--NH`, name ="Boston, MA", line =list(color ='violet')) %>%add_lines(y =~`UPT - Seattle--Tacoma, WA`, name ="Seattle, WA", line =list(color ='brown')) %>%add_lines(y =~`UPT - Miami--Fort Lauderdale, FL`, name ="Miami, FL", line =list(color ='cyan')) %>%layout(title ="Public Transit Monthly Ridership by City",xaxis =list(title ="Date"),yaxis =list(title ="Ridership (Total Unlinked Passenger Trips)"),legend =list(title =list(text ="Legend")) )plot
The most obvious aspect of this plot is the great distance between ridership in New York City and all other cities. While New York is certainly the most populous and dense city in the country, this disparity far exceeds that, which tells us that there are characteristics there that are not emulated in other urban areas. The other important thing to note is that all of these cities were greatly impacted in 2020, telling us that the overall trend applies everywhere with no notable exception here. However, the differences in how ridership was trending before and after the pandemic will be important to analyze.
fig <- lyft %>%plot_ly(x =~date,y =~Rides,type ="bar")%>%layout(title =list(text="Quarterly Active Users of Lyft - North America"),xaxis =list(title="Date"),yaxis =list(title="Active Users"))fig
These bar graphs, while they differ in scope due to Uber’s reporting of worldwide data and Lyft’s reporting of North American data, show a similar story to one another. Both services saw significant increases throughout the late 2010s. During the pandemic there was a drop in users just as there was for public transit. However, each service’s active user base has already returned to values close to or above their pre-pandemic values, which is a phenomenon not seen in any of our public transit data.
Vehicle Usage
Code
cars <-read_csv("../data/monthly_miles.csv")cars$observation_date <-as.Date(cars$observation_date, format ="%m/%d/%y")cars <- cars[cars$observation_date >=as.Date("1993-04-01"),]cars <- cars[cars$observation_date <=as.Date("2024-01-01"),]fig <- cars %>%plot_ly(x =~observation_date,y =~M12MTVUSM227NFWA,type ='scatter',mode ='lines',showlegend = F )%>%layout(title =list(text="Monthly Vehicle Miles Driven"),xaxis =list(title="Year"),yaxis =list(title="Vehicle Miles (Millions)"))fig
This plot shows a relatively consistent increase in vehicle miles driven for the majority of years, but with dips in 2008 and 2020. From what we know about these time period, this indicates that this statistic is heavily impacted by economic factors, much like public transit ridership. This will be valuable to include in multivariate analysis.
plot <-plot_ly(remote, x =~as.Date(Month)) %>%add_lines(y =~some_pct, name ="Teleworked Some Hours") %>%add_lines(y =~all_pct, name ="Teleworked All Hours") %>%layout(title ="Plot of Teleworkers over Time",xaxis =list(title ="Date"),yaxis =list(title ="Proportion of Workers 16+"),legend =list(title =list(text ="Legend")) )# Display the plotplot
Code
remote_long <- remote %>%pivot_longer(cols =c(all_pct, some_pct), names_to ="category", values_to ="Proportion of All Workers") %>%mutate(category =recode(category, all_pct ="Teleworked All Hours", some_pct ="Teleworked Some Hours"))fig <-plot_ly(remote_long, x =~Month, y =~`Proportion of All Workers`, color =~category, type ="bar") %>%layout(barmode ="stack", title ="Stacked Bar Chart of Teleworkers",yaxis =list(title="Proportion of All Workers 16+"))fig
As we can see here, the proportion of workers who telework at least a little bit has increased in the last two years. Even as we stray further from the effects of the pandemic, remote work appears to still make up a significant portion of the workforce. However, this rise is almost entirely due to the increase in people who telework some hours, indicating that those arrangements might be the more permanent phenomenon to consider. One additional important thing to note is the time window of this data, which makes it very difficult as exogenous variables when compared with other data. To my knowledge, there is no reliable time series dataset covering this topic that includes data from before, during, or immediately after the pandemic. Thus, this is good to visualize as a way to undertand context of the workforce, but may not be as useful in model fitting.
uber<-getSymbols("UBER",auto.assign =FALSE, from ="2020-01-01") uber=data.frame(uber)uber <-data.frame(uber,rownames(uber))colnames(uber)[7] ="date"uber$date<-as.Date(uber$date,"%Y-%m-%d")close=uber$UBER.Closeuber_ts=ts(close, start=decimal_date(as.Date("2020-01-01")), frequency =252)
Code
lyft<-getSymbols("LYFT",auto.assign =FALSE, from ="2020-01-01") lyft=data.frame(lyft)lyft <-data.frame(lyft,rownames(lyft))colnames(lyft)[7] ="date"lyft$date<-as.Date(lyft$date,"%Y-%m-%d")close=lyft$LYFT.Closelyft_ts=ts(close, start=decimal_date(as.Date("2020-01-01")), frequency =252)
plot <-plot_ly(rideshare_data, x =~Date) %>%add_lines(y =~Uber, name ="Uber Closing Price", line =list(color ='black')) %>%add_lines(y =~Lyft, name ="Lyft Closing Price", line =list(color ='#cc00bb')) %>%layout(title ="Rideshare Stock prices",xaxis =list(title ="Date"),yaxis =list(title ="Price"),legend =list(title =list(text ="Legend")) )# Display the plotplot
This plot tells us that there’s a great disparity in the financial performance of two of the top rideshare service companies in the last three years. Both Uber and Lyft saw sharp dips in stock price during the height of the pandemic, followed by a rise in early 2021, presumably due to the return of commuters requiring transportation again. So far, this matches up well with the rideshare usage plot seen above. However since 2022, Lyft’s stock price has reached values lower than even in 2020 and remained there somewhat consistently, while Uber’s stock price continues an upward trend.
Code
zoom<-getSymbols("ZM",auto.assign =FALSE, from ="2020-01-01") zoom=data.frame(zoom)zoom <-data.frame(zoom,rownames(zoom))colnames(zoom)[7] ="date"zoom$date<-as.Date(zoom$date,"%Y-%m-%d")close=zoom$ZM.Closezoom_ts=ts(close, start=decimal_date(as.Date("2020-01-01")), frequency =252)
Code
microsoft<-getSymbols("MSFT",auto.assign =FALSE, from ="2020-01-01") microsoft=data.frame(microsoft)microsoft <-data.frame(microsoft,rownames(microsoft))colnames(microsoft)[7] ="date"microsoft$date<-as.Date(microsoft$date,"%Y-%m-%d")close=microsoft$MSFT.Closemicrosoft_ts=ts(close, start=decimal_date(as.Date("2020-01-01")), frequency =252)
Code
cisco<-getSymbols("CSCO",auto.assign =FALSE, from ="2020-01-01") cisco=data.frame(cisco)cisco <-data.frame(cisco,rownames(cisco))colnames(cisco)[7] ="date"cisco$date<-as.Date(cisco$date,"%Y-%m-%d")close=cisco$CSCO.Closecisco_ts=ts(close, start=decimal_date(as.Date("2020-01-01")), frequency =252)
plot <-plot_ly(telecom_data, x =~Date) %>%add_lines(y =~Zoom, name ="Zoom Closing Price", line =list(color ='blue')) %>%add_lines(y =~Microsoft, name ="Microsoft Closing Price", line =list(color ='#008800')) %>%add_lines(y =~Cisco, name ="Cisco Closing Price", line =list(color ='#bb0000')) %>%layout(title ="Telecommunications Stock prices",xaxis =list(title ="Date"),yaxis =list(title ="Price"),legend =list(title =list(text ="Legend")) )# Display the plotplot
Somewhat similar to the rideshare plot, there is not much consistency between companies here. While Cisco’s stock price appears to consistently lag behind the others, and Microsoft has an upward trend with several interruptions, Zoom stock tells a very interesting story. Following the pandemic, this price skyrocketed to heights more than double its competitors, only to fall back to being consistenly under 100 dollars by 2022. It is important to note that these companies are not all equally associated with video conferencing in the way that Zoom is. Cisco and Microsoft’s stock are likely more related to their entire set of offerings, explaining why they are not as volatile.
Code
bp<-getSymbols("BP",auto.assign =FALSE, from ="2020-01-01") bp=data.frame(bp)bp <-data.frame(bp,rownames(bp))colnames(bp)[7] ="date"bp$date<-as.Date(bp$date,"%Y-%m-%d")
Code
shell<-getSymbols("SHEL",auto.assign =FALSE, from ="2020-01-01") shell=data.frame(shell)shell <-data.frame(shell,rownames(shell))colnames(shell)[7] ="date"shell$date<-as.Date(shell$date,"%Y-%m-%d")
Code
exxon<-getSymbols("XOM",auto.assign =FALSE, from ="2020-01-01") exxon=data.frame(exxon)exxon <-data.frame(exxon,rownames(exxon))colnames(exxon)[7] ="date"exxon$date<-as.Date(exxon$date,"%Y-%m-%d")
Code
chevron<-getSymbols("CVX",auto.assign =FALSE, from ="2020-01-01") chevron=data.frame(chevron)chevron <-data.frame(chevron,rownames(chevron))colnames(chevron)[7] ="date"chevron$date<-as.Date(chevron$date,"%Y-%m-%d")
plot <-plot_ly(oil_data, x =~Date) %>%add_lines(y =~BP, name ="BP Closing Price", line =list(color ='#66cc00')) %>%add_lines(y =~Shell, name ="Shell Closing Price", line =list(color ='#cccc00')) %>%add_lines(y =~Exxon, name ="Exxon Closing Price", line =list(color ='#bb0000')) %>%add_lines(y =~Chevron, name ="Chevron Closing Price", line =list(color ='#0000bb')) %>%layout(title ="Oil Stock prices",xaxis =list(title ="Date"),yaxis =list(title ="Price"),legend =list(title =list(text ="Legend")) )# Display the plotplot
Unlike the other industries, all four of these stock prices seem to rise and fall with one another. This says a great deal about the homogeneity of the oil industry. In this case, all four prices took a predictable hit in 2020, followed by a generally gradual climb thereafter. There is certainly less volatility here than in other industries.
Gas Prices
Code
gas <-read_csv("../data/gas_prices.csv")gas$Month <-as.Date(paste0(gas$Month, "-01"), format ="%b-%y-%d")gas <- gas[gas$Month >=as.Date("1993-04-01"),]gas <- gas[gas$Month <=as.Date("2024-01-01"),]gas <- gas[order(gas$Month), ]
Code
fig <- gas %>%plot_ly(x =~Month,y =~`Price per Gallon`,type ='scatter',mode ='lines',showlegend = F )%>%layout(title =list(text="Monthly Average Gas Prices"),xaxis =list(title="Year"),yaxis =list(title="Average Price (Dollars)"))fig
This plot of average monthly gas prices will serve as an indicator for the viablity of travel via automobile as an alternative to public transit methods. What we see here is that gas prices often correlate to larger macroeconomic trends, as prices dip during recessions and skyrocket during times of inflation and supply chain mishaps.
Finally, the unemployment rate is clearly heavily tied to larger macroeconomic trends, as well. We see large spikes in 2008 and 2020, times of great recession. Meanwhile, most of the time, unemployment rates gradually decrease in the absence of major economic events. This will provide information on the necessity to use public transit as a means of commuting to work.